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Abstract 

The task of parametric model selection is cast in terms of a statistical me- 
chanics on the space of probability distributions. Using the techniques of low- 
temperature expansions, we arrive at a systematic series for the Bayesian pos- 
terior probability of a model family that significantly extends known results 
in the literature. In particular, we arrive at a precise understanding of how 
Occam's Razor, the principle that simpler models should be preferred until the 
data justifies more complex models, is automatically embodied by probabil- 
ity theory. These results require a measure on the space of model parameters 
and we derive and discuss an interpretation of Jeffreys' prior distribution as a 
uniform prior over the distributions indexed by a family. Finally, we derive a 
theoretical index of the complexity of a parametric family relative to some true 
distribution that we call the razor of the model. The form of the razor imme- 
diately suggests several interesting questions in the theory of learning that can 
be studied using the techniques of statistical mechanics. 



1 Introduction 

In recent years increasingly precise experiments have directed the interest of biophysi- 
cists towards learning in simple neural systems. The typical context of such learning 
involves estimation of some behaviourally relevant information from a statistically 
varying environment. For example, the experiments of de Ruyter and collaborators 
have provided detailed measurements of the adaptive encoding of wide field horizon- 
tal motion by the HI neuron of the blowfly ([l^l)- Under many circumstances the 
associated problems of statistical estimation can be fruitfully cast in the language of 
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statistical mechanics, and the powerful techniques developed in that discipline can be 
brought to bear on questions regarding learning ([|T7[]). 

In this paper we are concerned with a problem that arises frequently in the context 
of biophysical and computational learning - the estimation of parametric models of 
some true distribution t based on a collection of data drawn from t. If we are given a 
particular family of parametric models (Gaussians, for example) the task of modelling 
t is reduced to parameter estimation, which is a relatively well-understood, though 
difficult, problem. Much less is known about the task of model family selection - 
for example, how do we choose between a family of Gaussians and a family of fifty 
exponentials as a model for t based on the available data? In this paper we will 
be concerned with the latter problem on which considerable ink has already been 
expended in the literature ( p|, IH , @, |, || , [H , m , p, [1§ ) . 

The first contribution of this paper is to cast Bayesian model family selection more 
clearly as a statistical mechanics on the space of probability distributions in the hope 
of making this important problem more accessible to physicists. In this language, a 
finite dimensional parametric model family is viewed as a manifold embedded in the 
space of probability distributions. The probability of the model family given the data 
can be identified with a partition function associated with a particular energy func- 
tional. The formalism bears a resemblance to the description of a disordered system 
in which the number of data points plays the role of the inverse temperature and in 
which the data plays the role of the disordering medium. Exploiting the techniques 
of low temperature expansions in statistical mechanics it is easy to extend existing 
results that use Gaussian approximations to the Bayesian posterior probability of a 
model family to find "Occam factors" penalizing complex models (P, |15|, 0). We 
find a systematic expansion in powers of where N is the number of data points 
and identify terms that encode accuracy, model dimensionality and robustness as 
well as higher order measures of simplicity. The subleading terms can be important 
when the number of data points is small and represent a limited attempt to move 
analysis of Bayesian statistics away from asymptotics towards the regime of small N 
that is often biologically relevant. The results presented here do not require the true 
distribution to be a member of the parametric family under consideration and the 
model degeneracies that can threaten analysis in such cases are dealt with by the 
method of collective coordinates from statistical mechanics. Some connections with 
the Minimum Description Length principle and stochastic complexity are discussed 

10111,11). 



In order to perform Bayesian model selection it is necessary to have a prior dis- 
tribution on the space of parameters of a model. Equivalently, we require the correct 
measure on the phase space defined by the parameter manifold in the analogue sta- 
tistical mechanical problem considered in this paper. In the absence of well-founded 
reasons to pick a particular prior distribution, the usual prescription is to pick an 
unbiased prior density that weights all parameters equally. However, this prescrip- 
tion is not invariant under reparametrization and we will argue that the correct prior 
should give equal weight to all distributions indexed by the parameters. Requiring 
all distributions to be a priori equally likely yields Jeffreys' prior on the parameter 
manifold, giving a new interpretation of this choice of prior density ([|12|). 
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Finally, consideration of the large limit of the asymptotic expansion of the 
Bayesian posterior probability leads us to define the razor of a model, a theoretical 
index of the complexity of a parametric family relative to a true distribution. In 
statistical mechanical terms, the razor is the quenched approximation of the disor- 
dered system studied in Bayesian statistics. Analysis of the razor using the techniques 
of statistical mechanics can give insights into the types of phenomena that can be 
expected in systems that perform Bayesian statistical inference. These phenomena 
include "phase transitions" in learning and adaptation to changing environments. In 
view of the length of this paper, applications of the general framework developed here 
to specific models relevant to biophysics will be left to future publications. 



2 Statistical Inference and Statistical Mechanics 

Suppose we are given a collection of outcomes E = {ei ... cat}, Cj G X drawn inde- 
pendently from a density t. Suppose also that we are given two parametric families of 
distributions A and B and we wish to pick one of them as the model family that we 
will use. The Bayesian approach to this problem consists of computing the posterior 
conditional probabilities Pr(A|i?) and Pt{B\E) and picking the family with the higher 
probability. Let A be parametrized by a set of real parameters O = {^i, . . . 6d}- Then 
Bayes Rule tells us that: 

In this expression Pr(y4) is the prior probability of the model family, w{Q) is a prior 
density on the parameter space and Pr{E) is a prior density on the A^ outcome 
sample space. The measure induced by the parametrization of the d dimensional 
parameter manifold is denoted d'^Q. Since we are interested in comparing Pt{A\E) 
with Pt{B\E), the prior Pt{E) is a common factor that we may omit, and for lack of 
any better choice we take the prior probabilities of A and B to be equal and omit them. 
Finally, throughout this paper we will assume that the model families of interest to us 
have compact parameter spaces. This condition is easily relaxed by placing regulators 
on non-compact parameter spaces, but we will not concern ourselves with this detail 
here. 



2.1 Natural Priors or Measures on Phase Space 

In order to make further progress we must identify the prior density w{Q). In the 
absence of a well-motivated prior, a common prescription is to use the uniform distri- 
bution on the parameter space since this is deemed to reflect complete ignorance (||15||). 
In fact, this choice suffers from the serious deficiency that the uniform priors relative 
to different parametrizations can assign different probability masses to the same sub- 
set of parameters ( [p!2| , pTSU). Consequently, if w{Q) was uniform in the parameters. 



the probability of a model family would depend on the arbitrary parametrization. 
The problem can be cured by making the much more reasonable requirement that 
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all distributions rather than all parameters are equally likely.^ In order to implement 
this requirement we should give equal weight to all distinguishable distributions on 
a model manifold. However, nearby parameters index very similar distributions. So 
let us ask the question, "How do we count the number of distinct distributions in the 
neighbourhood of a point on a parameter manifold?" Essentially, this is a question 
about the embedding of the parameter manifold in the space of distributions. Points 
that are distinguishable as elements of may be mapped to indistinguishable points 
(in some suitable sense) of the embedding space. 

To answer the question, let 9p and Qg index two distributions in a parametric 
family and let E = {ei ■ ■ ■ e^} be drawn independently from one of Qp or 0g. In the 
context of model estimation, a suitable measure of distinguishability can be derived by 
asking how well we can guess which of Qp or Qg produced E. Let oat be the probability 
that Qg is mistaken for Qp and let (3^ he the probability that Qp is mistaken for Qg. 
Let be the smallest possible Pn given that < e. Then Stein's Lemma tells 
us that limAr^oo(— 1/^) ln/3^ = D{Qp\\Qq) where D{p\\q) = J dx p{x) hi{p{x) / q{x)) 
is the relative entropy between the densities p and g (0]). 

As shown in Appendix ^ the proof of Stein's Lemma shows that the minimum 
error exceeds a fixed (3* in the region where k/N > D{Qp\\Qg) with k = — ln/3* + 
ln(l — e).0 By taking f3* close to 1 we can identify the region around Qp where the 
distributions are not very distinguishable from the one indexed by Qp. As N grows 
large for fixed k, any in this region is necessarily close to Qp since D{Qp\\Qg) 
attains a minimum of zero when Qp = Qg. Therefore, setting AO = Qg — Qp, 
Taylor expansion gives D{Qp\\Qg) ^ (1/2) Jij{Qp)AQ'AQ^ + 0{AQ^) where Jy = 
V,/,. V0^-D(Op||6p + $)|$=o is the Fisher Information. (We use the convention that 
repeated indices are summed over.) 

In a certain sense, the relative entropy, D{Qp\\Qq), appearing in this problem 
is the natural distance between probability distributions in the context of model 
selection. Although it does not itself define a metric, the Taylor expansion locally 
yields a quadratic form with the Fisher Information acting as the metric. If we 
accept Jij as the natural metric, differential geometry immediately tells us that the 
reparametrization invariant measure on the parameter manifold is d^Q Vdet J (fl], §]). 
Normalizing this measure by dividing by / d'^Q^/ det J gives the so-called Jeffreys' 
prior on the parameters. 

A more satisfying explanation of the choice of prior proceeds by directly counting 
the number of distinguishable distributions in the neighbourhood of a point on a 
parameter manifold. Define the volume of indistinguishahility at levels e, (3*, and 

to be the volume of the region around Qp where k/N > D{Qp\\Qg) so that the 

^This applies the principle of maximum entropy on the invariant space of distributions rather 
than the arbitrary space of parameters 

^This assertion is not strictly true. See Appendix |^ for more details. 

■^We have assumed that the derivatives with respect to O commute with expectations taken in 
the distribution Op to identify the Fisher Information with the matrix of second derivatives of the 
relative entropy. 
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probability of error in distinguishing Qg from 0p is high. We find to leading order: 



If P* is very close to one, the distributions inside K,/3*,Ar are not very distinguishable 
and the Bayesian prior should not treat them as separate distributions. We wish to 
construct a measure on the parameter manifold that reflects this indistinguishability. 
We also assume a principle of "translation invariance" by supposing that volumes 
of indistinguishability at given values of N, j3* and e should have the same measure 
regardless of where in the space of distributions they are centered. An integration 
measure reflecting these principles of indistinguishability and translation invariance 
can be defined at each level e, and by covering the parameter manifold econom- 
ically with volumes of indistinguishability and placing a delta function in the center 
of each element of the cover. This definition reflects indistinguishability by ignoring 
variations on a scale smaller than the covering volumes and reflects translation in- 
variance by giving each covering volume equal weight in integrals over the parameter 
manifold. The measure can be normalized by an integral over the entire parameter 
manifold to give a prior distribution. The continuum limit of this discretized measure 
is obtained by taking the limits /3* — > 1, e — >■ and N ^ oo. In this limit the 
measure counts distributions that are completely indistinguishable = 1) even in 
the presence of an infinite amount of data (A^ = oo).^ 

To see the effect of the above procedure, imagine a parameter manifold which can 
be partitioned into k regions in each of which the Fisher Information is constant. Let 
Ji, Ui and Vi be the Fisher Information, parametric volume and volume of indistin- 
tuishability in the ith region. Then the prior assigned to the ith volume by the above 



procedure will be P, = iUi/V,)/ Y^^iiUj/Vj) = Uiy/detJi/ j:j=i Uj^det Jj. Since all 
the P*, e and A^ dependences cancel we are now free to take the continuum limit of 
Pj. This suggests that the prior density induced by the prescription described in the 
previous paragraph is: 



Jdet J{Q) 

w{e) = ' (3) 
/ d'iQ Jdet J(e) 



By paying careful attention to technical difficulties involving sets of measure zero 
and certain sphere packing problems, it can be rigorously shown that the normal- 
ized continuum measure on a parameter manifold that reflects indistinguishability 
and translation invariance is w{Q) or Jeffreys' prior In essence, the heuristic 

argument above and the derivation in 0] show how to "divide out" the volume of 
indistinguishable distributions on a parameter manifold and hence give equal weight 
to equally distinguishable volumes of distributions. In this sense, Jeffreys' prior is 
seen to be a uniform prior on the distributions indexed by a parametric family. 

^The a and {3 errors can be treated more symmetrically using the Chernoff bound instead of 
Stein's lemma, but we will not do that here. 
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2.2 Connection With Statistical Mechanics 



Putting everything together we get the following expression for the Bayesian posterior 
probability of a parametric family in the absence of any prior knowledge about the 
relative likelihood of the distributions indexed by the family. 

This equation resembles a partition function with a temperature and an energy 
function (— 1/iV) lnPr(ii^|0). The dependence on the data E is similar to the depen- 
dence of a disordered partition function on the specific set of defects introduced into 
the system. 

The analogy can be made stronger since the strong law of large numbers says that 
i-l/N) lnPr(E|0) = i-l/N) ELlnPr(ei|e) converges in the almost sure sense to: 



lnPr(e^|9) 
N 



dxt{x)\n (ttt^ I - / dxt{x)\n{t{x)) = D{t\\&) + h{t) 



Pr(x|e) 

(5) 

Here D(t||0) is the relative entropy between the true distribution and the distribution 
indexed by 0, and h{t) is the differential entropy of the true distribution that is 
presumed to be finite. With this large N limit in mind we rewrite the posterior 
probability in Equation ^ as the following partition function: 

^^^^'""^ = Jd^ 

where i/o(0) =^(^11©) and 6) = {-1 / N) InFi {E\e) - D{t\\e) -h{t). (Equa- 

tion ^ differs from Equation ^ by an irrelevant factor of exp [—Nh{t)]). Hq can be 
regarded as the "energy" of the "state" B while Hd is the additional contribution that 
arises via interaction with the "defects" represented by the data. It is instructive to 
examine the quenched approximation to this disordered partition function. (See |jl4| 



for a discussion of quenching in statistical mechanical systems.) Quenching is carried 
out by taking the expectation value of the energy of a state in the distribution gen- 
erating the defects. In the above system EtlHa] = giving the quenched posterior 
probability: 

Fr{A\E)Q = ^ ^ (7) 

In Section ^ we will see that the logarithm of the posterior probability converges to 
the logarithm of the quenched probability in a certain sense. This will lead us to 
regard the quenched probability as a sort of theoretical index of the complexity of a 
parametric family relative to a given true distribution. 
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3 Asymptotic Analysis or Low- Temperature Ex- 
pansion 

Equation ^ in the previous section represents the full content of Bayesian model 
selection. However, in order to extract some insight it is necessary to examine special 
cases. Let lnPr(£'|G) be a smooth function of B that attains a global minimum at 
G) and assume that Jij{Q) is a smooth function of G that is positive definite at G. 
Finally, suppose that G lies in the interior of the compact parameter space and that 
the values of local minima are bounded away from the global minimum by some 
For any given b, for sufficiently large A^, the Bayesian posterior probabaility will then 
be dominated by the neighbourhood of Q and we can carry out a low temperature 
expansion around the saddlepoint at Q. 

We take the metric on the parameter manifold to be the Fisher Information since 
the Jeffreys' prior has the form of a measure derived from such a metric. This choice 
of metric also follows the work described in If. We will use to indicate the 
covariant derivative with respect to G^, with a flat connection for the Fisher Infor- 
mation metric.0 Readers who are unfamiliar with covariant derivatives may read 
as the partial derivative with respect to G^ since we will not be emphasizing the 
geometric content of the covariant derivative. 

Let i,,...,^ = (-l/iV)V^,---V^, lnPr(E|G)|e and F^,..^, = V^, ■ ■ ■ V^Tr In J,,|e 
where Tr represents the Trace of a matrix. Writing (det J)^/^ as exp [(l/2)Tr In J], 
we Taylor expand the exponent in the integrand of the Bayesian posterior around G, 
and rescale the integration variable to $ = A^^/^(G — G) to arrive at: 

~[lnPr(£;|e)-iTrlnJ(e)]^-d/2 f Jd^ -((1/2) V<i!>M</,-+G($)) 

/ d'^QJdetJij 



Here G'($) collects the terms that are suppressed by powers of A^: 

G(^) = y°°-, i_ r ^ / (htJ'i . . . (hi^(i+2) _ ±p (ktJ'i . . . AiJ-i 



N 



— I d)^^ ■ ■ ■ (b^-^ —F 6^^^6^^'^ 

41 -'/^i---/x4V^ 22! A'lMaV V 



+ 

+ 0(]vb) (9) 



As before, repeated indices are summed over. The integral in Equation |^ may now 
be evaluated in a series expansion using a standard trick from statistical mechanics 
( ||1 If ) . Define a "source" h = {hi . . . h^} as an auxiliary variable. Then it is easy to 
verify that: 



I' 



where the argument of G, $ = (<^^ . . . (j)'^), has been replaced by Vh = {dhi ■ ■ ■ dh^} and 
we assume that the derivatives commute with the integral. The remaining obstruction 

^In Section |3.2| we will discuss how to relax these conditions. 

^See H, ||] for discussions of differential geometry in a statistical setting. 
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is the compactness of the parameter space. We make the final assumption that the 
bounds of the integration can be extended to infinity with neghgible error since G is 
sufficiently in the interior or because N is sufficiently large. 

Performing the Gaussian integral in Equation |10| and applying the differential 
operator expG(V/i) we find an asymptotic series in powers of It turns out to 

be most useful to examine Xe{A) = — lnPT{A\E). Defining V = J d'^'Q^det J(0) we 
find to 0{l/N): 



Xe{A) 



N 

1 J -^M1M2M3M4 



N 



In Pr{E\0) 
N 



-r 2 iii^v 2 I dot 7^,(6) ' 



(27r)''/2 



4! 



22! 



-l\/il^t2 



+ 



-l\/i2A»l 



-^A'1M2M3-^'^1'^2 '^3 

2! 3! 3! 



2! 4 2! 2! 



-l\^fl/t2 



+ 



(^/-1^M1M2 ^J-l^AtSI'l (^J-1^1'21'3 _|_ 

^J~ljMlA»2^J-ljM3A'4 -|- . . . 



_|_ -^M1-^M2M3M4 

~'~ 2! 2 2! 3! 



Terms of higher orders in are easily evaluated with a little labour, and systematic 
diagrammatic expansions can be developed (JITj). In the next section we will discuss 



the meaning of Equation 11 



3.1 Meaning of the Asymptotic Expansion 

We can see why the Bayesian posterior measures simplicity and accuracy of a para- 



metric family by examining Equation |TT]and noting that models with larger Pt{A\E) 
and hence smaller Xe{A) are better. The 0{N) term, A^(— In Pr(ii^|B)/A^), which 
dominates asymptotically, is the log likelihood of the data evaluated at the maximum 
likelihood point .[] It measures the accuracy with which the parametric family can 
describe the available data. We will see in Section ^ that for sufficiently large N 
model families with the smallest relative entropy distance to the true distribution are 
favoured by this term. The term of 0{N) arises from the saddlepoint value of the 
integrand in Equation ^ and represents the Landau approximation to the partition 
function. 

The term of O(lnA^) penalizes models with many degrees of freedom and is a 
measure of simplicity. This term arises "physically" from the statistical fiuctuations 
around the saddlepoint configuration. These fiuctuations cause the partition function 
in Equation ^to scale as N~'^^'^ leading to the logarithmic term in xe- Note that the 
term of 0{N) and O(lnA^) have appeared together in the literature as the stochastic 
complexity of a parametric family relative to a collection of data ([f^, M)- This 



definition is justified by arguing that a family with the lowest stochastic complexity 
provides the shortest codes for the data in the limit that N oo. Our results suggest 
that stochastic complexity is merely a truncation of the logarithm of the posterior 
probability of a model family and that adding the subleading terms in xe to the 
definition of stochastic complexity would yield shorter codes for finite A^. 



^This term is 0{N), not 0(1), because (l/A^) lnPr(i?|8) approaches a finite limit at large N. 
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The 0(1) term, which arises from the determinant of quadratic fluctuations around 
the saddlepoint, is even more interesting. The determinant of is proportional to 
the volume of the ellipsoid in parameter space around G where the value of the in- 
tegrand of the Bayesian posterior is significant .0 The scale for determining whether 
det/~^ is large or small is set by the Fisher Information on the surface whose de- 
terminant defines the volume element. Consequently the term ln(det J/ det/)^/^ can 
be understood as measuring the robustness of the model in the sense that it mea- 
sures the relative volume of the parameter space which provides good models of the 
data. More robust models in this sense will be less sensitive to the precise choice 
of parameters. We also observe from the discussion regarding Jeffreys' prior that 
the volume of indistinguishability around G* is proportional to (det J)^^/^. So the 
quantity (det J/ det/)*^^/^) is essentially proportional to the ratio Viarge/Vmdist, the 
ratio of the volume where the integrand of the Bayesian posterior is large to the vol- 
ume of indistinguishability introduced earlier. Essentially, a model family is better 
(more natural or robust) if it contains many distinguishable distributions that are 
close to the true. Related observations have been made before in [|T3|, |T^ and in 
but without the interpretation in terms of the robustness of a model family. 

The term In {27r)'^/V can be understood as a preference for models that have a 
smaller invariant volume in the space of distributions and hence are more constrained. 
The terms proportional to are less easy to interpret. They involve higher deriva- 
tives of the metric on the parameter manifold and of the relative entropy distances 
between points on the manifold and the true distribution. This suggests that these 
terms essentially penalize high curvatures of the model manifold, but it is hard to 
extract such an interpretation in terms of components of the curvature tensor on the 
manifold. It is worth noting that while terms of 0(1) and larger in Xe{A) depend 
at most on the measure (prior distribution) assigned to the parameter manifold, the 
terms of 0(1/A^) depend on the geometry via the connection coefficients in the co- 
variant derivatives. For this reason, the 0{1/N) terms are the leading probes of the 
effect that the geometry of the space of distributions has on statistical inference in a 
Bayesian setting and so it would be very interesting to analyze them. 

Bayesian model family inference embodies Occam's Razor because, for small N, 
the subleading terms that measure simplicity and robustness will be important, while 
for large A^, the accuracy of the model family dominates. 

3.2 Analysis of More General Situations 

The asymptotic expansion in Equation |ll] and the subsequent analysis were carried 
out for the special case of a posterior probability with a single global maximum in 
the integrand that lay in the interior of the parameter space. Nevertheless, the basic 
insights are applicable far more generally. First of all, if the global maximum lies on 
the boundary of the parameter space, we can account for the portion of the peak that 
is cut off by the boundary and reach essentially the same conclusions. Secondly, if 

^If we fix a fraction / < 1 wtiere / is close to 1, the integrand of the Bayesian posterior wiU be 
greater that / times the peak value in an elliptical region around the maximum. 
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there are multiple discrete global maxima, each contributes separately to the asymp- 
totic expansion and the contributions can be added to reach the same conclusions. 
The most important difficulty arises when the global maximum is degenerate so that 
matrix / in Equation |lT] has zero eigenvalues. The eigenvectors corresponding to 
these zeroes are tangent to directions in parameter space in which the value of the 
maximum is unchanged up to second order in perturbations around the maximum. 
These sorts of degeneracies are particularly likely to arise when the true distribu- 
tion is not a member of the family under consideration, and can be dealt with by 
the method of collective coordinates. Essentially, we would choose new parameters 
for the model, a subset of which parametrize the degenerate subspace. The integral 
over the degenerate subspace then factors out of the integral in Equation |l^ and es- 
sentially contributes a factor of the volume of the degenerate subspace times terms 
arising from the action of the differential operator exp[— G(V/i)]. The evaluation of 
specific examples of this method in the context of statistical inference will be left to 
future publications. 

There are situations in which the perturbative expansion in powers of is 
invalid. For example, the partition function in Equation ^ regarded as a function of 
N may have singularities. These singularities and the associated breakdown of the 
perturbative analysis of this section would be of the utmost interest since they would 
be signatures of "phase transitions" in the process of statistical inference. This point 
will be discussed further in Section ^ 



4 The Razor of A Model Family 

The large limit of the partition function in Equation ^ suggests the definition of 
an ideal theoretical index of the complexity of a parametric family relative to a given 
true distribution. 

We know from Equation | that (-1/A^) In [Ft{E\Q)] D{t\\e) + h{t) as grows 
large. Now assume that the maximum likelihod estimator is consistent in the sense 
that = argmaxe lnPr(ii^|B) converges in probability to 0* = argmine D(t||G) as 
N grows large. Also suppose that the log likelihood of a single outcome lnPr(ej|G) 
considered as a family of functions of 6 indexed by is equicontinuous at B*.|^ 
Finally, suppose that all derivatives of lnPr(ej|9) with respect to 9 are also equicon- 
tinuous at 9*. 

Subject to the assumptions in the previous paragraph it is easily shown that 
(— 1/A^) lnPr(£'|9) — >• D{t\\Q*) + h{t) as A^ grows large. Next, using the covariant 
derivative with respect to 9 defined in Section |^, let J^^...^^ = V^^ ■ ■ ■ ^7^.1^(^11 9) |e*- 
It also follows that /^i-..^, J^ll■■■^li ([i])- Since the terms in the asymptotic expansion 
of {1/N){xe — Nh{t)) (Equation [ll|) are continuous functions of lnPr(£'|9) and its 

^ In other words, assume that given any neighbourhood of O*, G) faUs in that neighbourhood 
with high probabihty for sufficiently large N . If the maximum likelihood estimator is not consistent, 
statistics has very little to say about the inference of probability densities. 

^"in other words, given any e > 0, there is a neighbourhood of M of 8* such that for every e-i and 
6 e M, I lnPr(e,|e) - lnPr(e,|e*)| < e. 
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derivatives, they individually converge to limits obtained by replacing each I by 
J and (-l/iV)lnPr(E|e) by D{t\\Q*) + h{t). Define {-1/N)\yiRm{A) to be the 
sum of the series of limits of the individual terms in the asymptotic expansion of 
{1/N){XE-Nh{t)): 



-hiRN{A) 



Dim*] 



d 



In 



det Jij(e) 



N ' 2N 2N [detJ^,(0)J 

This formal series of limits can be resummed to obtain: 



(27r 



\d/2 



V 



Rn{A) 



jd^eVl 



(12) 



(13) 



We have encountered RNiA) before in Section ^]2|as the quenched approximation to 
the partition function in Equation ^. Rn{A) will be called the razor of the model 
family A. 

The razor, R]S[{A), is a theoretical index of the complexity of the model family 
A relative to the true distribution t given data points. In a certain sense, the 
razor is the ideal quantity that Bayesian methods seek to estimate from the data 
available in a given realization of the model inference problem. Indeed, the quenched 
approximation to the Bayesian partition function consists precisely of averaging over 
the data in different realizations. The terms in the expansion of the log razor in 



Equation [T^ are the ideal analogues of the terms in xe since they arise from derivatives 
of the relative entropy distance between distributions indexed by the model family 
and the true distribution. The leading term tells us that for sufficiently large N, 
Bayesian inference picks the model family that comes closest to the true distribution 
in relative entropy. The subleading terms have the same interpretations as the terms 
in Xe discussed in the previous section, except that they are the ideal quantities to 
which the corresponding terms in xe tend when enough data is available. 

The razor is useful when we know the true distribution as well as the model families 
being used by a particular system and we wish to analyze the expected behaviour of 
Bayesian inference. It is also potentially useful as a tool for modelling and analysis 
of the general types of phenomena that can occur in Bayesian inference - different 
relative entropy distances D(t||G) can yield radically different learning behaviours as 
discussed in the next section. The razor is considerably easier to analyze than the full 
Bayesian posterior probability since the quenched approximation to Equation ^ given 
in Equation ^ defines a statistical mechanics on the space of distributions in which 
the "disorder" has been averaged out. The tools of statistical mechanics can then 
be straightforwardly applied to a system with temperature 1/N and energy function 
D{t\\Q). 
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5 Biophysical Relevance and Some Open Ques- 
tions 



The general framework described in this paper is relevant to biophysics if we believe 
that neural systems optimize their accumulation of information from a statistically 
varying environment. This is likely to be true in at least some circumstances since 
an organism derives clear advantages from rapid and efficient detection and encoding 
of information. For example, see the discussions of Bialek and Atick of neural signal 
processing systems that approach physical and information theoretic limits (0, ^). 
A creature such as a fly is faced with the problem of estimating the statistical pro- 
file of its environment from the small amount of data available at its retina. The 
general formalism presented in this paper applies to such problems and an optimally 
designed fly would implement the formalism subject to the constraints of its biological 
hardware. In this section we will discuss several interesting questions in the theory 
of learning that can be discussed effectively in the statistical mechanical language 
introduced here. 

First of all, consider the possibility of "phase transitions" in the disordered par- 
tition function that describes the Bayesian posterior probability or in the quenched 
approximation deflning the razor. Phase transitions arise from a competition between 
entropy and energy which, in the present context, is a competition between simplicity 
and accuracy. We should expect the existence of systems in which inference at small 

is dominated by "simpler" and more "robust" saddlepoints whereas at large A^ 
more "accurate" saddlepoints are favoured. As discussed in Section |3T^, the distri- 
butions in the neighbourhood of "simpler" and more "robust" saddlepoints are more 
concentrated near the true.0 The transitions between regimes dominated by these 
different saddlepoints would manifest themselves as singularities in the perturbative 
methods that led to the asymptotic expansions for xe and lni?7v(^)- 

The phase transitions discussed in the previous paragraph are interesting even 
when the task at hand is not the comparison of model families, but merely the selec- 
tion of parameters for a given family. In Section we have interpreted the terms of 
0(1) in Xe as measurements of the "robustness" or "naturalness" of a model. These 
robustness terms can be evaluated at different saddlepoints of a given model and a 
more robust point may be preferable at small A^ since the parameter estimation would 
then be less sensitive to fluctuations in the data. 

So far we have concentrated on the behaviour of the Bayesian posterior and the 
razor as function of the number of data points. Instead, we could ask how they 
behave when the true distribution is changed. For example, this can happen in a 
biophysical context if the environment sensed by a fly changes when it suddenly 
flnds itself indoors. In statistical mechanical terms, we wish to know what happens 
when the energy of a system is time-dependent. If the change is abrupt, the system 
will dynamically move between equilibria deflned by the energy functions before and 

^^In Section |3.l| we have discussed how Bayesian inference embodies Occam's razor by penahzing 
complex famihes until the data justifies their choice. Here we are discussing Occam's razor for choice 
of saddlepoints within a given family. 
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after the change. If the change is very slow we would expect adaptation that proceeds 
gradually. In the language of statistical inference, these adaptive processes correspond 
to learning of changes in the true distribution. 

A final question that has been touched on, but not analyzed, in this paper is 
the influence of the geometry of parameter manfiolds on statistical inference. As 
discussed in Section |3.1| , terms of 0(1/A^) and smaller in the asymptotic expansions 
of the log Bayesian posterior and the log razor depend on details of the geometry 
of the parameter manifold. It would be very interesting to understand the precise 
meaning of this dependence. 

6 Conclusion 

In this paper we have cast parametric model selection as a disordered statistical 
mechanics on the space of probability distributions. A low temperature expansion was 
used to develop the asymptotics of Bayesian methods beyond the analyses available 
in the literature and it was shown that Bayesian methods for model family inference 
embody Occam's razor. While reaching these results, we derived and discussed a 
novel interpretation of Jeffreys' prior density as the uniform prior on the probability 
distributions indexed by a parametric family. By considering the large N limit and the 
quenched approximation of the disordered system implemented by Bayesian inference, 
we derived the razor, a theoretical index of the complexity of a parametric family 
relative to a true distribution. Finally, in view of the analogue statistical mechanical 
interpretation, we discussed various interesting phenomena that should be present 
in systems that perform Bayesian learning. It is easy to create models that display 
these phenomena simply by considering families of distributions for which D(t||B) has 
the right structure. It would be interesting to examine models of known biophysical 
relevance to see if they exhibit such effects, so that experiments could be carried out 
to verify their presence or absence in the real world. In view of the length of the 
present paper, this project will be left to a future publication. 
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A Measuring Indistinguishability of Distributions 

Let us take Bp and Og to be points on a parameter manifold. Since we are working 
in the context of density estimation a suitable measure of the distinguishability of 
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Bp and 6^ should be derived by taking data points drawn from either p or g and 
asking how well we can guess which distribution produced the data. If p and q do not 
give very distinguishable distributions, they should not be counted separately since 
that would count the same distribution twice. 

Precisely this question of distinguishability is addressed in the classical theory of 
hypothesis testing. Suppose {ei . . . e^} G E'^ are drawn iid from one of /i and /2 with 
-D(/i||/2) < oo. Let Ajv ^ be the acceptance region for the hypothesis that the 
distribution is fi and define the error probabilities = fJ^^A'j^) and Pjy = f^^Ajy). 
{A'^ is the complement of Ajv in E^ and denotes the product distribution on E'^ 
describing N iid outcomes drawn from /.) In these definitions is the probability 
that /i was mistaken for /2 and Pn is the probability of the opposite error. Stein's 
Lemma tells us how low we can make (3^ given a particular value of a^. Indeed, let 
us define = miiij^j^cE^ ,aN<e Pn- Then Stein's Lemma tells us that: 

hm hm lln/3^ = -D(M|/2) (14) 

By examining the proof of Stein's Lemma (0) we find that for fixed e and sufficiently 
large N the optimal choice of decision region places the following bound on jS' 



- DiUk) - + 'Jtil^ < ^,„^^ < -D(f4h) + 6. + hil^ (15) 

where otv < e for sufficiently large A^. The 6n are any sequence of positive constants 
that satisfy the property that: 

«^ = /f (I^Elnflly - >SN)<e (16) 

for all sufficiently large N. Now (1/iV) E^Ii ln(/i(ei)//2(ei)) converges to £'(/i||/2) 
by the law of large numbers since D{fi\\f2) = -E/i(ln(/i(ej)//2(ei)). So, for any fixed 
6 we have: 

i=i J2{ei) 

for all sufficiently large N. For a fixed e and a fixed N let be the collection of 



5 > which satisfy Equation Let be the infimum of the set A^^at. Equation [17 
guarantees that for any 5 > 0, for any sufficiently large A^, < 6^^ < We conclude 
that SeN chosen in this way is a sequence that converges to zero as A^ — > cxd while 



satisfying the condition in Equation ITB which is necessary for proving Stein's Lemma. 



We will now apply these facts to the problem of distinguishability of points on a 
parameter manifold. Let Gp and index two distributions on a parameter manifold 
and suppose that we are given A^ outcomes generated independently from one of 
them. We are interested in using Stein's Lemma to determine how distinguishable 
Gp and G^ are. By Stein's Lemma: 

-Z;(Gp||GJ-(3,iv(Gg) + — < — — — < -Z;(Gp||GJ+(3,jv(Gg) + — 

(18) 
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where we have written S^iy{Qq) and P^i^q) emphasize that these quantities are 
functions of 6, for a fixed 6p. Let A = —D{Qp\\Qg) + (1/A^) ln(l — a^v) be the average 
of the upper and lower bounds in Equation |T8|. Then A > — D(6p||9g) + (1/A^) ln(l— e) 
because the SeN^Qq) have been chosen to satisfy Equation |l^. We now define the set 
of distributions Un = {Qg : -DiOjeg) + (1/A^)ln(l - e) > {1/N)lni3*} where 
1 > /3* > is some fixed constant. Note that as N —>■ oo, D{Qp\\Qg) for 
Qq e U]\f. We want to show that 11^ is a set of distributions which cannot be very 
well distinguished from Gp. The first way to see this is to observe that the average of 
the upper and lower bounds on ln/3^ is greater than or equal to ln/3* for Qg G 11^. 
So, in this loose, average sense, the error probability exceeds j3* for Qg & Uiy. 

More carefully, note that (1/A^)ln(l — a^) > (1/A^)ln(l — e) by choice of the 
5,N{Qg)- So, using Equation in we see that {l/N)\nl3f^{Qg) > (1/iV) ln/3* - 5e7v(Og)- 
Exponentiating this inequality we find that: 

1 > [PN{Qg)f^''^ > e-^^^(®') (19) 

The significance of this expression is best understood by considering parametric fam- 
ilies in which, for every Qg, Xg{ei) = ln(0p(ei)/6g(ej)) is a random variable with 
finite mean and bounded variance, in the distribution indexed by 6p. In that case, 
taking b to be the bound on the variances, Chebyshev's inequality says that: 

-miie,)l>.)s^sJi, (20) 

In order to satisy < e it suffices to choose 5 = (b/NeY'''^. So, if the bounded 
variance condition is satisfied, S^N^Qg) < (b/NeY^"^ for any Qg and therefore we have 




the limit limAr^oo supg^^g^^ 5e7v(0g) = 0. Applying this limit to Equation |T9| we find 
that: 

1 > lim inf [/3^(e„)]^^/^^ > 1 X lim inf e'^^^^®") = 1 (21) 

In summary we find that limTv^oo infe,e(7]v[/^Af(0g)]*'"'^^^'* = 1- This is to be contrastd 
with the behaviour of Pl({Qg) for any fixed Qg ^ Qp for which \imN^oo[PNiQq)]^^^^^ = 
exp — Z}(6p||6g) < 1. We have essentially shown that the sets Un contain distribu- 
tions that are not very distinguishable from Bp. The smallest one-sided error prob- 
ability Pn for distinguishing between Qp and 0^ G Un remains essentially constant 



leading to the asymptotics in Equation 21 



Define k = — ln/3* + ln(l — e) so that we can summarize the region Un of 
high probability of error f3* at fixed e as n/N > D{6p\\6g). In this region, the 
distributions are indistinguishable from Bp with error probabilities oat < e and 

(/?^)(^/^)>(/3*)(^/^)exp-5,^. 
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